1,596 research outputs found

    Doublet method for very fast autocoding

    Get PDF
    BACKGROUND: Autocoding (or automatic concept indexing) occurs when a software program extracts terms contained within text and maps them to a standard list of concepts contained in a nomenclature. The purpose of autocoding is to provide a way of organizing large documents by the concepts represented in the text. Because textual data accumulates rapidly in biomedical institutions, the computational methods used to autocode text must be very fast. The purpose of this paper is to describe the doublet method, a new algorithm for very fast autocoding. METHODS: An autocoder was written that transforms plain-text into intercalated word doublets (e.g. "The ciliary body produces aqueous humor" becomes "The ciliary, ciliary body, body produces, produces aqueous, aqueous humor"). Each doublet is checked against an index of doublets extracted from a standard nomenclature. Matching doublets are assigned a numeric code specific for each doublet found in the nomenclature. Text doublets that do not match the index of doublets extracted from the nomenclature are not part of valid nomenclature terms. Runs of matching doublets from text are concatenated and matched against nomenclature terms (also represented as runs of doublets). RESULTS: The doublet autocoder was compared for speed and performance against a previously published phrase autocoder. Both autocoders are Perl scripts, and both autocoders used an identical text (a 170+ Megabyte collection of abstracts collected through a PubMed search) and the same nomenclature (neocl.xml, containing over 102,271 unique names of neoplasms). In side-by-side comparison on the same computer, the doublet method autocoder was 8.4 times faster than the phrase autocoder (211 seconds versus 1,776 seconds). The doublet method codes 0.8 Megabytes of text per second on a desktop computer with a 1.6 GHz processor. In addition, the doublet autocoder successfully matched terms that were missed by the phrase autocoder, while the phrase autocoder found no terms that were missed by the doublet autocoder. CONCLUSIONS: The doublet method of autocoding is a novel algorithm for rapid text autocoding. The method will work with any nomenclature and will parse any ascii plain-text. An implementation of the algorithm in Perl is provided with this article. The algorithm, the Perl implementation, the neoplasm nomenclature, and Perl itself, are all open source materials

    Automatic extraction of candidate nomenclature terms using the doublet method

    Get PDF
    BACKGROUND: New terminology continuously enters the biomedical literature. How can curators identify new terms that can be added to existing nomenclatures? The most direct method, and one that has served well, involves reading the current literature. The scholarly curator adds new terms as they are encountered. Present-day scholars are severely challenged by the enormous volume of biomedical literature. Curators of medical nomenclatures need computational assistance if they hope to keep their terminologies current. The purpose of this paper is to describe a method of rapidly extracting new, candidate terms from huge volumes of biomedical text. The resulting lists of terms can be quickly reviewed by curators and added to nomenclatures, if appropriate. The candidate term extractor uses a variation of the previously described doublet coding method. The algorithm, which operates on virtually any nomenclature, derives from the observation that most terms within a knowledge domain are composed entirely of word combinations found in other terms from the same knowledge domain. Terms can be expressed as sequences of overlapping word doublets that have more specific meaning than the individual words that compose the term. The algorithm parses through text, finding contiguous sequences of word doublets that are known to occur somewhere in the reference nomenclature. When a sequence of matching word doublets is encountered, it is compared with whole terms already included in the nomenclature. If the doublet sequence is not already in the nomenclature, it is extracted as a candidate new term. Candidate new terms can be reviewed by a curator to determine if they should be added to the nomenclature. An implementation of the algorithm is demonstrated, using a corpus of published abstracts obtained through the National Library of Medicine's PubMed query service and using "The developmental lineage classification and taxonomy of neoplasms" as a reference nomenclature. RESULTS: A 31+ Megabyte corpus of pathology journal abstracts was parsed using the doublet extraction method. This corpus consisted of 4,289 records, each containing an abstract title. The total number of words included in the abstract titles was 50,547. New candidate terms for the nomenclature were automatically extracted from the titles of abstracts in the corpus. Total execution time on a desktop computer with CPU speed of 2.79 GHz was 2 seconds. The resulting output consisted of 313 new candidate terms, each consisting of concatenated doublets found in the reference nomenclature. Human review of the 313 candidate terms yielded a list of 285 terms approved by a curator. A final automatic extraction of duplicate terms yielded a final list of 222 new terms (71% of the original 313 extracted candidate terms) that could be added to the reference nomenclature. CONCLUSION: The doublet method for automatically extracting candidate nomenclature terms can be used to quickly find new terms from vast amounts of text. The method can be immediately adapted for virtually any text and any nomenclature. An implementation of the algorithm, in the Perl programming language, is provided with this article

    Maternal immune activation and strain specific interactions in the development of autism-like behaviors in mice.

    Get PDF
    It is becoming increasingly apparent that the causes of autism spectrum disorders (ASD) are due to both genetic and environmental factors. Animal studies provide important translational models for elucidating specific genetic or environmental factors that contribute to ASD-related behavioral deficits. For example, mouse research has demonstrated a link between maternal immune activation and the expression of ASD-like behaviors. Although these studies have provided insights into the potential causes of ASD, they are limited in their ability to model the important interactions between genetic variability and environmental insults. This is of particular concern given the broad spectrum of severity observed in the human population, suggesting that subpopulations may be more susceptible to the adverse effects of particular environmental insults. It is hypothesized that the severity of effects of maternal immune activation on ASD-like phenotypes is influenced by the genetic background in mice. To test this, pregnant dams of two inbred strains (that is, C57BL/6J and BTBR T(+)tf/J) were exposed to the viral mimic polyinosinic-polycytidylic acid (polyI:C), and their offspring were tested for the presence and severity of ASD-like behaviors. To identify differences in immune system regulation, spleens were processed and measured for alterations in induced cytokine responses. Strain-treatment interactions were observed in social approach, ultrasonic vocalization, repetitive grooming and marble burying behaviors. Interestingly, persistent dysregulation of adaptive immune system function was only observed in BTBR mice. Data suggest that behavioral and immunological effects of maternal immune activation are strain-dependent in mice

    Tumor taxonomy for the developmental lineage classification of neoplasms

    Get PDF
    BACKGROUND: The new "Developmental lineage classification of neoplasms" was described in a prior publication. The classification is simple (the entire hierarchy is described with just 39 classifiers), comprehensive (providing a place for every tumor of man), and consistent with recent attempts to characterize tumors by cytogenetic and molecular features. A taxonomy is a list of the instances that populate a classification. The taxonomy of neoplasia attempts to list every known term for every known tumor of man. METHODS: The taxonomy provides each concept with a unique code and groups synonymous terms under the same concept. A Perl script validated successive drafts of the taxonomy ensuring that: 1) each term occurs only once in the taxonomy; 2) each term occurs in only one tumor class; 3) each concept code occurs in one and only one hierarchical position in the classification; and 4) the file containing the classification and taxonomy is a well-formed XML (eXtensible Markup Language) document. RESULTS: The taxonomy currently contains 122,632 different terms encompassing 5,376 neoplasm concepts. Each concept has, on average, 23 synonyms. The taxonomy populates "The developmental lineage classification of neoplasms," and is available as an XML file, currently 9+ Megabytes in length. A representation of the classification/taxonomy listing each term followed by its code, followed by its full ancestry, is available as a flat-file, 19+ Megabytes in length. The taxonomy is the largest nomenclature of neoplasms, with more than twice the number of neoplasm names found in other medical nomenclatures, including the 2004 version of the Unified Medical Language System, the Systematized Nomenclature of Medicine Clinical Terminology, the National Cancer Institute's Thesaurus, and the International Classification of Diseases Oncolology version. CONCLUSIONS: This manuscript describes a comprehensive taxonomy of neoplasia that collects synonymous terms under a unique code number and assigns each tumor to a single class within the tumor hierarchy. The entire classification and taxonomy are available as open access files (in XML and flat-file formats) with this article

    The development of common data elements for a multi-institute prostate cancer tissue bank: The Cooperative Prostate Cancer Tissue Resource (CPCTR) experience

    Get PDF
    BACKGROUND: The Cooperative Prostate Cancer Tissue Resource (CPCTR) is a consortium of four geographically dispersed institutions that are funded by the U.S. National Cancer Institute (NCI) to provide clinically annotated prostate cancer tissue samples to researchers. To facilitate this effort, it was critical to arrive at agreed upon common data elements (CDEs) that could be used to collect demographic, pathologic, treatment and clinical outcome data. METHODS: The CPCTR investigators convened a CDE curation subcommittee to develop and implement CDEs for the annotation of collected prostate tissues. The draft CDEs were refined and progressively annotated to make them ISO 11179 compliant. The CDEs were implemented in the CPCTR database and tested using software query tools developed by the investigators. RESULTS: By collaborative consensus the CPCTR CDE subcommittee developed 145 data elements to annotate the tissue samples collected. These included for each case: 1) demographic data, 2) clinical history, 3) pathology specimen level elements to describe the staging, grading and other characteristics of individual surgical pathology cases, 4) tissue block level annotation critical to managing a virtual inventory of cases and facilitating case selection, and 5) clinical outcome data including treatment, recurrence and vital status. These elements have been used successfully to respond to over 60 requests by end-users for tissue, including paraffin blocks from cases with 5 to 10 years of follow up, tissue microarrays (TMAs), as well as frozen tissue collected prospectively for genomic profiling and genetic studies. The CPCTR CDEs have been fully implemented in two major tissue banks and have been shared with dozens of other tissue banking efforts. CONCLUSION: The freely available CDEs developed by the CPCTR are robust, based on "best practices" for tissue resources, and are ISO 11179 compliant. The process for CDE development described in this manuscript provides a framework model for other organ sites and has been used as a model for breast and melanoma tissue banking efforts

    Classifying the precancers: A metadata approach

    Get PDF
    BACKGROUND: During carcinogenesis, precancers are the morphologically identifiable lesions that precede invasive cancers. In theory, the successful treatment of precancers would result in the eradication of most human cancers. Despite the importance of these lesions, there has been no effort to list and classify all of the precancers. The purpose of this study is to describe the first comprehensive taxonomy and classification of the precancers. As a novel approach to disease classification, terms and classes were annotated with metadata (data that describes the data) so that the classification could be used to link precancer terms to data elements in other biological databases. METHODS: Terms in the UMLS (Unified Medical Language System) related to precancers were extracted. Extracted terms were reviewed and additional terms added. Each precancer was assigned one of six general classes. The entire classification was assembled as an XML (eXtensible Mark-up Language) file. A Perl script converted the XML file into a browser-viewable HTML (HyperText Mark-up Language) file. RESULTS: The classification contained 4700 precancer terms, 568 distinct precancer concepts and six precancer classes: 1) Acquired microscopic precancers; 2) acquired large lesions with microscopic atypia; 3) Precursor lesions occurring with inherited hyperplastic syndromes that progress to cancer; 4) Acquired diffuse hyperplasias and diffuse metaplasias; 5) Currently unclassified entities; and 6) Superclass and modifiers. CONCLUSION: This work represents the first attempt to create a comprehensive listing of the precancers, the first attempt to classify precancers by their biological properties and the first attempt to create a pathologic classification of precancers using standard metadata (XML). The classification is placed in the public domain, and comment is invited by the authors, who are prepared to curate and modify the classification

    The Lippmann–Schwinger Formula and One Dimensional Models with Dirac Delta Interactions

    Get PDF
    We show how a proper use of the Lippmann–Schwinger equation simplifies the calculations to obtain scattering states for one dimensional systems perturbed by N Dirac delta equations. Here, we consider two situations. In the former, attractive Dirac deltas perturbed the free one dimensional Schrödinger Hamiltonian. We obtain explicit expressions for scattering and Gamow states. For completeness, we show that the method to obtain bound states use comparable formulas, although not based on the Lippmann–Schwinger equation. Then, the attractive N deltas perturbed the one dimensional Salpeter equation. We also obtain explicit expressions for the scattering wave functions. Here, we need regularisation techniques that we implement via heat kernel regularisation

    An informatics model for tissue banks – Lessons learned from the Cooperative Prostate Cancer Tissue Resource

    Get PDF
    BACKGROUND: Advances in molecular biology and growing requirements from biomarker validation studies have generated a need for tissue banks to provide quality-controlled tissue samples with standardized clinical annotation. The NCI Cooperative Prostate Cancer Tissue Resource (CPCTR) is a distributed tissue bank that comprises four academic centers and provides thousands of clinically annotated prostate cancer specimens to researchers. Here we describe the CPCTR information management system architecture, common data element (CDE) development, query interfaces, data curation, and quality control. METHODS: Data managers review the medical records to collect and continuously update information for the 145 clinical, pathological and inventorial CDEs that the Resource maintains for each case. An Access-based data entry tool provides de-identification and a standard communication mechanism between each group and a central CPCTR database. Standardized automated quality control audits have been implemented. Centrally, an Oracle database has web interfaces allowing multiple user-types, including the general public, to mine de-identified information from all of the sites with three levels of specificity and granularity as well as to request tissues through a formal letter of intent. RESULTS: Since July 2003, CPCTR has offered over 6,000 cases (38,000 blocks) of highly characterized prostate cancer biospecimens, including several tissue microarrays (TMA). The Resource developed a website with interfaces for the general public as well as researchers and internal members. These user groups have utilized the web-tools for public query of summary data on the cases that were available, to prepare requests, and to receive tissues. As of December 2005, the Resource received over 130 tissue requests, of which 45 have been reviewed, approved and filled. Additionally, the Resource implemented the TMA Data Exchange Specification in its TMA program and created a computer program for calculating PSA recurrence. CONCLUSION: Building a biorepository infrastructure that meets today's research needs involves time and input of many individuals from diverse disciplines. The CPCTR can provide large volumes of carefully annotated prostate tissue for research initiatives such as Specialized Programs of Research Excellence (SPOREs) and for biomarker validation studies and its experience can help development of collaborative, large scale, virtual tissue banks in other organ systems

    Development and evaluation of an open source software tool for deidentification of pathology reports

    Get PDF
    BACKGROUND: Electronic medical records, including pathology reports, are often used for research purposes. Currently, there are few programs freely available to remove identifiers while leaving the remainder of the pathology report text intact. Our goal was to produce an open source, Health Insurance Portability and Accountability Act (HIPAA) compliant, deidentification tool tailored for pathology reports. We designed a three-step process for removing potential identifiers. The first step is to look for identifiers known to be associated with the patient, such as name, medical record number, pathology accession number, etc. Next, a series of pattern matches look for predictable patterns likely to represent identifying data; such as dates, accession numbers and addresses as well as patient, institution and physician names. Finally, individual words are compared with a database of proper names and geographic locations. Pathology reports from three institutions were used to design and test the algorithms. The software was improved iteratively on training sets until it exhibited good performance. 1800 new pathology reports were then processed. Each report was reviewed manually before and after deidentification to catalog all identifiers and note those that were not removed. RESULTS: 1254 (69.7 %) of 1800 pathology reports contained identifiers in the body of the report. 3439 (98.3%) of 3499 unique identifiers in the test set were removed. Only 19 HIPAA-specified identifiers (mainly consult accession numbers and misspelled names) were missed. Of 41 non-HIPAA identifiers missed, the majority were partial institutional addresses and ages. Outside consultation case reports typically contain numerous identifiers and were the most challenging to deidentify comprehensively. There was variation in performance among reports from the three institutions, highlighting the need for site-specific customization, which is easily accomplished with our tool. CONCLUSION: We have demonstrated that it is possible to create an open-source deidentification program which performs well on free-text pathology reports
    • 

    corecore